Minimax Regret Bounds for Reinforcement Learning

نویسندگان

  • Mohammad Gheshlaghi Azar
  • Ian Osband
  • Rémi Munos
چکیده

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of Õ( √ HSAT+HSA+H √ T ) where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves over the best previous known bound Õ(HS √ AT ) achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when T ≥ HSA and SA ≥ H , it leads to a regret of Õ( √ HSAT ) that matches the established lower bound of Ω( √ HSAT ) up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we define Bernstein-based "exploration bonuses" that use the empirical variance of the estimated values at the next states (to improve scaling in H).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any nonanytime algorithm is the “Doubling Trick”. In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences...

متن کامل

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of th...

متن کامل

Optimal Non-Asymptotic Lower Bound on the Minimax Regret of Learning with Expert Advice

We prove non-asymptotic lower bounds on the expectation of the maximum of d independent Gaussian variables and the expectation of the maximum of d independent symmetric random walks. Both lower bounds recover the optimal leading constant in the limit. A simple application of the lower bound for random walks is an (asymptotically optimal) non-asymptotic lower bound on the minimax regret of onlin...

متن کامل

Partial Monitoring - Classification, Regret Bounds, and Algorithms

In a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed acti...

متن کامل

Efficient Policy Learning

We consider the problem of using observational data to learn treatment assignment policies that satisfy certain constraints specified by a practitioner, such as budget, fairness, or functional form constraints. This problem has previously been studied in economics, statistics, and computer science, and several regret-consistent methods have been proposed. However, several key analytical compone...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017